12 research outputs found

    Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence

    Get PDF
    During real-time graphics rendering, objects are processed by the GPU in the order they are submitted by the CPU, and occluded surfaces are often processed even though they will end up not being part of the final image, thus wasting precious time and energy. To help discard occluded surfaces, most current GPUs include an Early-Depth test before the fragment processing stage. However, to be effective it requires that opaque objects are processed in a front-to-back order. Depth sorting and other occlusion culling techniques at the object level incur overheads that are only offset for applications having substantial depth and/or fragment shading complexity, which is often not the case in mobile workloads. We propose a novel architectural technique for mobile GPUs, Visibility Rendering Order (VRO), which reorders objects front-to-back entirely in hardware by exploiting the fact that the objects in graphics animated applications tend to keep its relative depth order across consecutive frames (temporal coherence). Since order relationships are already tested by the Depth Test, VRO incurs minimal energy overheads because it just requires adding a small hardware to capture that information and use it later to guide the rendering of the following frame. Moreover, unlike other approaches, this unit works in parallel with the graphics pipeline without any performance overhead. We illustrate the benefits of VRO using various unmodified commercial 3D applications for which VRO achieves 27% speed-up and 14.8% energy reduction on average over a state-of-the-art mobile GPU.Peer ReviewedPostprint (author's final draft

    Dynamic sampling rate: harnessing frame coherence in graphics applications for energy-efficient GPUs

    Get PDF
    In real-time rendering, a 3D scene is modelled with meshes of triangles that the GPU projects to the screen. They are discretized by sampling each triangle at regular space intervals to generate fragments which are then added texture and lighting effects by a shader program. Realistic scenes require detailed geometric models, complex shaders, high-resolution displays and high screen refreshing rates, which all come at a great compute time and energy cost. This cost is often dominated by the fragment shader, which runs for each sampled fragment. Conventional GPUs sample the triangles once per pixel; however, there are many screen regions containing low variation that produce identical fragments and could be sampled at lower than pixel-rate with no loss in quality. Additionally, as temporal frame coherence makes consecutive frames very similar, such variations are usually maintained from frame to frame. This work proposes Dynamic Sampling Rate (DSR), a novel hardware mechanism to reduce redundancy and improve the energy efficiency in graphics applications. DSR analyzes the spatial frequencies of the scene once it has been rendered. Then, it leverages the temporal coherence in consecutive frames to decide, for each region of the screen, the lowest sampling rate to employ in the next frame that maintains image quality. We evaluate the performance of a state-of-the-art mobile GPU architecture extended with DSR for a wide variety of applications. Experimental results show that DSR is able to remove most of the redundancy inherent in the color computations at fragment granularity, which brings average speedups of 1.68x and energy savings of 40%.This work has been supported by the the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (Grant No. 833057), Spanish State Research Agency (MCIN/AEI) under Grant PID2020-113172RB-I00, the ICREA Academia program, and the Generalitat de Catalunya under Grant FI-DGR 2016. Funding was provided by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. TIN2016-75344-R).Peer ReviewedPostprint (published version

    Improving the energy efficiency of the graphics pipeline by reducing overshading

    Get PDF
    The most common task of GPUs is to render images in real time. When rendering a 3D scene, a key step is determining which parts of every object are visible in the final image. There are different approaches to solve the visibility problem, the Z-Test being the most common in modern GPUs. A main factor that significantly penalizes the energy efficiency of a GPU, especially in the mobile arena, is the so-called overshading, which happens when a portion of an object is shaded and rendered but finally occluded by another object. This useless work results in a waste of energy, however, the conventional Z-Test only eliminates a fraction of it. In this paper we present a novel microarchitectural technique, the ¿-Test, to drastically reduce overshading on a Tile-Based Rendering (TBR) architecture. The proposed approach leverages frame-to-frame coherence by taking advantage of the costly and valuable calculations made in previous frames. In particular, we propose to reuse information from the Z-Buffer of the previous frame, which is currently discarded. We make the observation that due to the existing frame-to-frame coherence, the Z-Buffer of a frame will have a high similarity in many areas with that of the previous frame. As a result, the proposed technique avoids many costly computations and off-chip memory accesses. Our experimental evaluation shows that ¿-Test reduces the average energy consumption of the overall GPU/Memory system by 15.7 % and the runtime of the evaluated benchmarks by 10.6 % on average.This work has been supported by the the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency under grant TIN2016-75344-R (AEI/-FEDER, EU) and the ICREA Academia program. D. Corbal´an-Navarro has been supported by a PhD research fellowship from the University of Murcia.Peer ReviewedPostprint (author's final draft

    Metodología de síntesis para uso de bloques DSP con HDL sobre FPGAS

    Get PDF
    En el presente trabajo se propone una metodología para sintetizar código en HDL de tal manera que se haga uso de los bloques DSP48E que aparecen en la familia de FPGAs Virtex 5 de Xilinx. Para conseguirlo se modifica el código HDL original para que la herramienta de síntesis sea capaz de reconocer la parte del código que debe implementarse en los DSPs. En primer lugar se ha intentado conseguir el objetivo empleando construcciones de código HDL con las que XST, herramienta de síntesis de Xilinx, infiere los DSPs. Ante la imposibilidad de obtener ciertas configuraciones específicas para los DSPs se plantea la posibilidad de utilizar la plantilla de macro específica DSP48E, que permite instanciar directamente dichos bloques. Para ello es necesaria una metodología que permite sustituir las operaciones aritmeticológicas más comunes por sus equivalentes mapeadas en un bloque DSP48E. En dicha metodología se proponen transformaciones de código que mantienen la funcionalidad original del diseño y limitan el uso de bloques DSP48E. Los resultados experimentales muestran que los diseños obtenidos con XST al aplicar la metodología utilizan un número de DSPs inferior que el obtenido infiriendo automáticamente los DSP con XST, consiguiéndose además una disminución del área y un aumento de la frecuencia del diseño. [ABSTRACT] This work proposes a methodology to synthesize HDL code in such a way that makes use of the DSP48E blocks presented in the Xilinx Virtex 5 FPGA family. The original HDL code is modified in order to achieve that the synthesis tool is able to recognize the code that must be implemented in the DSP blocks. First we have tried to achieve the objective using HDL code constructs that would infer DSP blocks, directly with the Xilinx Synthesis Tool (XST) . Since it is unable to obtain certain specific settings for the DSP, raises the possibility of using the DSP48E specific macro template, which allows directly instantiate these blocks. This requires a methodology to replace the most common arithmetic operations to the equivalents in the DSP48. In the methodology proposed the code transformations done maintain the original functionality of the design and limit the use of DSP48E blocks. Experimental results show that the designs obtained by applying the methodology within XST use a lower number of DSPs that those obtained automatically by XST. Moreover, in these designs there is a decrease in the area and an increase in the frequency

    Algoritmos de Triggering para detección de eventos y su aplicación para detección de Dust devils sobre FPGAS

    Get PDF
    Este trabajo trata sobre el estudio de algoritmos disparadores. Dichos algoritmos tienen múltiples aplicaciones en lo referente a predicción de eventos geológicos y meteorológicos como son terremotos, tornados y erupciones volcánicas entre otros. Concretamente estudiaremos como la aplicación de dichos algoritmos para detección y predicción de dust devils en Marte. Estos son remolinos de aire que se forman debido a un grave contraste entre la temperatura de la superficie con la de la atmósfera. Apoyándonos en los datos de la misión espacial Mars Pathfinder, estudiaremos estos curiosos fenómenos que recorren la superficie marciana con gran frecuencia. Nuestro objetivo final es descubrir si la implementación de los algoritmos disparadores, aplicados a la detección de dust devils, es más eficiente ejecutando un software en un procesador corriente, o por otra parte ejecutándolo en un sistema hardware especifico generado con hardware reconfigurable. Finalmente mediante comparativas de tiempo, estudiaremos que método se adapta mejor a los algoritmos disparadores. [ABSTRACT] This work is about the study of triggering algorithms. These algorithms have multiple applications in the field of prediction of geological events, as earthquakes, tornados or volcanic eruptions. Concretely we will study how to apply them in detection and prediction of martian dust devils, those whirlwind are formed for a high contrast between superficial and atmospheric temperature. We will take support on Mars Pathfinder mission data, and we will study these phenomens which walk the martian ground with a high frequency. Our final goal is to find out if the implementation of triggering algorithms applied to dust devils detection is more efficient using a common computer or in the other hand using an specific system generated with reconfigurable hardware. Finally by using times matches we will find out which method is better for triggering algorithms

    Ultra-low power render based collision detection for CPU/GPU systems

    No full text
    Smartphones have become powerful computing systems able to carry out complex tasks, such as web browsing, image processing and gaming, among others. Graphics animation applications such as 3D games represent a large percentage of downloaded applications for mobile devices and the trend is towards more complex and realistic scenes with accurate 3D physics simulations, like those in laptops and desktops. Collision detection (CD) is one of the main algorithms used in any physics kernel. However, real-time highly accurate CD is very expensive in terms of energy consumption and this parameter is of paramount importance for mobile devices since it has a direct effect on the autonomy of the system. In this work, we propose an energy-efficient, high-fidelity CD scheme that leverages some intermediate results of the rendering pipeline. It also adds a new and simple hardware block to the GPU pipeline that works in parallel with it and completes the remaining parts of the CD task with extremely low power consumption and more speed than traditional schemes. Using commercial Android applications, we show that our scheme reduces the energy consumption of the CD by 99.8% (i.e., 448x times smaller) on average. Furthermore, the execution time required for CD in our scheme is almost three orders of magnitude smaller (600x speedup) than the time required by a conventional technique executed in a CPU. These dramatic benefits are accompanied by a higher fidelity CD analysis (i.e., with finer granularity), which improves the quality and realism of the application.Peer Reviewe

    Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence

    No full text
    During real-time graphics rendering, objects are processed by the GPU in the order they are submitted by the CPU, and occluded surfaces are often processed even though they will end up not being part of the final image, thus wasting precious time and energy. To help discard occluded surfaces, most current GPUs include an Early-Depth test before the fragment processing stage. However, to be effective it requires that opaque objects are processed in a front-to-back order. Depth sorting and other occlusion culling techniques at the object level incur overheads that are only offset for applications having substantial depth and/or fragment shading complexity, which is often not the case in mobile workloads. We propose a novel architectural technique for mobile GPUs, Visibility Rendering Order (VRO), which reorders objects front-to-back entirely in hardware by exploiting the fact that the objects in graphics animated applications tend to keep its relative depth order across consecutive frames (temporal coherence). Since order relationships are already tested by the Depth Test, VRO incurs minimal energy overheads because it just requires adding a small hardware to capture that information and use it later to guide the rendering of the following frame. Moreover, unlike other approaches, this unit works in parallel with the graphics pipeline without any performance overhead. We illustrate the benefits of VRO using various unmodified commercial 3D applications for which VRO achieves 27% speed-up and 14.8% energy reduction on average over a state-of-the-art mobile GPU.Peer Reviewe

    Early visibility resolution for removing ineffectual computations in the graphics pipeline

    No full text
    GPUs' main workload is real-time image rendering. These applications take a description of a (animated) scene and produce the corresponding image(s). An image is rendered by computing the colors of all its pixels. It is normal that multiple objects overlap at each pixel. Consequently, a significant amount of processing is devoted to objects that will not be visible in the final image, in spite of the widespread use of the Early Depth Test in modern GPUs, which attempts to discard computations related to occluded objects. Since animations are created by a sequence of similar images, visibility usually does not change much across consecutive frames. Based on this observation, we present Early Visibility Resolution (EVR), a mechanism that leverages the visibility information obtained in a frame to predict the visibility in the following one. Our proposal speculatively determines visibility much earlier in the pipeline than the Early Depth Test. We leverage this early visibility estimation to remove ineffectual computations at two different granularities: pixel-level and tile-level. Results show that such optimizations lead to 39% performance improvement and 43% energy savings for a set of commercial Android graphics applications running on stateof-the-art mobile GPUs.Peer ReviewedPostprint (published version

    Rendering elimination: early discard of redundant tiles in the graphics pipeline

    No full text
    GPUs are one of the most energy-consuming components for real-time rendering applications, since a large number of fragment shading computations and memory accesses are involved. Main memory bandwidth is especially taxing battery-operated devices such as smart-phones. TileBased Rendering GPUs divide the screen space into multiple tiles that are independently rendered in on-chip buffers, thus reducing memory bandwidth and energy consumption. We have observed that, in many animated graphics workloads, a large number of screen tiles have the same color across adjacent frames. In this paper, we propose Rendering Elimination (RE), a novel micro-architectural technique that accurately determines if a tile will be identical to the same tile in the preceding frame before rasterization by means of comparing signatures. Since RE identifies redundant tiles early in the graphics pipeline, it completely avoids the computation and memory accesses of the most power consuming stages of the pipeline, which substantially reduces the execution time and the energy consumption of the GPU. For widely used Android applications, we show that RE achieves an average speedup of 1.74x and energy reduction of 43% for the GPU/Memory system, surpassing by far the benefits of Transaction Elimination, a state-of-the-art memory bandwidth reduction technique available in some commercial Tile-Based Rendering GPUs.Peer ReviewedPostprint (published version

    Early visibility resolution for removing ineffectual computations in the graphics pipeline

    No full text
    GPUs' main workload is real-time image rendering. These applications take a description of a (animated) scene and produce the corresponding image(s). An image is rendered by computing the colors of all its pixels. It is normal that multiple objects overlap at each pixel. Consequently, a significant amount of processing is devoted to objects that will not be visible in the final image, in spite of the widespread use of the Early Depth Test in modern GPUs, which attempts to discard computations related to occluded objects. Since animations are created by a sequence of similar images, visibility usually does not change much across consecutive frames. Based on this observation, we present Early Visibility Resolution (EVR), a mechanism that leverages the visibility information obtained in a frame to predict the visibility in the following one. Our proposal speculatively determines visibility much earlier in the pipeline than the Early Depth Test. We leverage this early visibility estimation to remove ineffectual computations at two different granularities: pixel-level and tile-level. Results show that such optimizations lead to 39% performance improvement and 43% energy savings for a set of commercial Android graphics applications running on stateof-the-art mobile GPUs.Peer Reviewe
    corecore